Journal of Clinical Epidemiology — Latest Matching Preprints

1

AI-Assisted Data Extraction with a Large Language Model: A Study Within Reviews

Gartlehner, G.; Kugley, S.; Crotty, K.; Viswanathan, M.; Dobrescu, A.; Nussbaumer-Streit, B.; Booth, G.; Treadwell, J.; Han, J. M.; Wagner, J.; Apaydin, E.; Coppola, E.; Maglione, M.; Hilscher, R.; Chew, R.; Pilar, M.; Swanton, B.; Kahwati, L.

2025-03-21 health systems and quality improvement 10.1101/2025.03.20.25324350 medRxiv

Top 0.1%

44.4%

Show abstract

BackgroundData extraction is a critical but error-prone and labor-intensive task in evidence synthesis. Unlike other artificial intelligence (AI) technologies, large language models (LLMs) do not require labeled training data for data extraction. ObjectiveTo compare an AI-assisted to a traditional y data extraction process. DesignStudy within reviews (SWAR) utilizing a prospective, parallel group comparison with blinded data adjudicators. SettingWorkflow validation within six ongoing systematic reviews of interventions under real-world conditions. InterventionInitial data extraction using an LLM (Claude versions 2.1, 3.0 Opus, and 3.5 Sonnet) verified by a human reviewer. MeasurementsConcordance, time on task, accuracy, recall, precision, and error analysis. ResultsThe six systematic reviews of the SWAR contributed 9,341 data elements, extracted from 63 studies. Concordance between the two methods was 77.2%. The accuracy of the AI-assisted approach compared with enhanced human data extraction was 91.0%, with a recall of 89.4% and a precision of 98.9%. The AI-assisted approach had fewer incorrect extractions (9.0% vs. 11.0%) and similar risks of major errors (2.5% vs. 2.7%) compared to the traditional human-only method, with a median time saving of 41 minutes per study. Missed data items were the most frequent errors in both approaches. LimitationsAssessing the concordance of data extractions and classifying errors required subjective judgment. Tracking time on task consistently was challenging. ConclusionThe use of an LLM can improve accuracy of data extraction and save time in evidence synthesis. Results reinforce previous findings that human-only data extraction is prone to errors. Primary Funding SourceUS Agency for Healthcare Research and Quality, RTI International RegistrationSWAR28 Gerald Gartlehner (2023 FEB 11 2102).pdf

2

Estimating the prevalence of discrepancies between study registrations and publications: A systematic review and meta-analyses

TARG Meta-Research Group & Collaborators, ; Thibault, R. T.; Clark, R.; Pedder, H.; van den Akker, O.; Westwood, S.; Munafo, M.

2021-07-28 health systems and quality improvement 10.1101/2021.07.07.21259868 medRxiv

Top 0.1%

43.1%

Show abstract

ObjectivesProspectively registering study plans in a permanent time-stamped and publicly accessible document is becoming more common across disciplines and aims to reduce risk of bias and make risk of bias transparent. Selective reporting persists, however, when researchers deviate from their registered plans without disclosure. This systematic review aimed to estimate the prevalence of undisclosed discrepancies between prospectively registered study plans and their associated publication. We further aimed to identify the research disciplines where these discrepancies have been observed, whether interventions to reduce discrepancies have been conducted, and gaps in the literature. DesignSystematic review and meta-analyses. Data sourcesScopus and Web of Knowledge, published up to 15 December 2019. Eligibility criteriaArticles that included quantitative data about discrepancies between registrations or study protocols and their associated publications. Data extraction and synthesisEach included article was independently coded by two reviewers using a coding form designed for this review (osf.io/728ys). We used random-effects meta-analyses to synthesize the results. ResultsWe reviewed k = 89 articles, which included k = 70 that reported on primary outcome discrepancies from n = 6314 studies and, k = 22 that reported on secondary outcome discrepancies from n = 1436 studies. Meta-analyses indicated that between 29% to 37% (95% confidence interval) of studies contained at least one primary outcome discrepancy and between 50% to 75% (95% confidence interval) contained at least one secondary outcome discrepancy. Almost all articles assessed clinical literature, and there was considerable heterogeneity. We identified only one article that attempted to correct discrepancies. ConclusionsMany articles did not include information on whether discrepancies were disclosed, which version of a registration they compared publications to, and whether the registration was prospective. Thus, our estimates represent discrepancies broadly, rather than our target of undisclosed discrepancies between prospectively registered study plans and their associated publications. Discrepancies are common and reduce the trustworthiness of medical research. Interventions to reduce discrepancies could prove valuable. Registrationosf.io/ktmdg. Protocol amendments are listed in Supplementary Material A.

3

Impact of the PATH Statement on Analysis and Reporting of Heterogeneity of Treatment Effect in Clinical Trials: A Scoping Review

Selby, J. V.; Maas, C. C. H. M.; Fireman, B.; Kent, D.

2024-05-06 epidemiology 10.1101/2024.05.06.24306774 medRxiv

Top 0.1%

40.4%

Show abstract

BackgroundThe PATH Statement (2020) proposed predictive modeling for examining heterogeneity in treatment effects (HTE) in randomized clinical trials (RCTs). It distinguished risk modeling, which develops a multivariable model predicting individual baseline risk of study outcomes and examines treatment effects across risk strata, from effect modeling, which directly estimates individual treatment effects from models that include treatment, multiple patient characteristics and interactions of treatment with selected characteristics. PurposeTo identify, describe and evaluate findings from reports that cite the Statement and present predictive modeling of HTE in RCTs. Data ExtractionWe identified reports using PubMed, Google Scholar, Web of Science, SCOPUS through July 5, 2024. Using double review with adjudication, we assessed consistency with Statement recommendations, credibility of HTE findings (applying criteria adapted from the Instrument to assess Credibility of Effect Modification Analyses (ICEMAN)), and clinical importance of credible findings. ResultsWe identified 65 reports (presenting 31 risk models, 41 effect models). Contrary to Statement recommendations, only 25 of 48 studies with positive overall findings included a risk model; most effect models included multiple predictors with little prior evidence for HTE. Claims of HTE were noted in 23 risk modeling and 31 effect modeling reports, but risk modeling met credibility criteria more frequently (87 vs 32 percent). For effect models, external validation of HTE findings was critical in establishing credibility. Credible HTE from either approach was usually judged clinically important (24 of 30). In 19 reports from trials suggesting overall treatment benefits, modeling identified subgroups of 5-67% of patients predicted to experience no benefit or net treatment harm. In five that found no overall benefit, subgroups of 25-60% of patients were nevertheless predicted to benefit. ConclusionsMultivariable predictive modeling identified credible, clinically important HTE in one third of 65 reports. Risk modeling found credible HTE more frequently; effect modeling analyses were usually exploratory, but external validation served to increase credibility.

4

State of play in individual participant data meta-analyses of randomised trials: Systematic review and consensus-based recommendations

Seidler, A. L.; Aagerup, J.; Nicholson, L.; Hunter, K.; Bajpai, R.; Hamilton, D.; Love, T.; Marlin, N.; Nguyen, D.; Riley, R.; Rydzewska, L.; Simmonds, M.; Stewart, L.; Tam, W.; Tierney, J.; Wang, R.; Amstutz, A.; Briel, M.; Burdett, S.; Ensor, J.; Hattle, M.; Libesman, S.; Liu, Y.; Schandelmaier, S.; Siegel, L.; Snell, K.; Sotiropoulos, J.; Vale, C.; White, I.; Williams, J.; Godolphin, P.

2026-02-04 epidemiology 10.64898/2026.02.03.26345481 medRxiv

Top 0.1%

38.9%

Show abstract

BackgroundIndividual participant data (IPD) meta-analyses obtain, harmonise and synthesise the raw individual-level data from multiple studies, and are increasingly important in an era of data sharing and personalised medicine to inform clinical practice and policy. Objectives(1) Describe the landscape of IPD meta-analysis of randomised trials over time; (2) establish current practice in design, conduct, analysis and reporting for pairwise IPD meta-analysis; and (3) derive recommendations to improve the conduct of and methods for future IPD meta-analyses. DesignPart 1: systematic review of all published IPD meta-analyses of randomised trials; Part 2: in-depth review of current methodological practice for pairwise IPD meta-analysis; and Part 3: adapted nominal group technique to derive consensus recommendations for IPD meta-analysis authors, educators and methodologists. Data sourcesMEDLINE, Embase, and the Cochrane Database of Systematic Reviews (via the Ovid interface). Eligibility criteriaPart 1: all IPD meta-analyses of randomised trials published before February 2024, evaluating intervention effects and based on a systematic search. Part 2: all pairwise IPD meta-analyses from part 1 published between February 2022 and February 2024. Part 3: Selected panel of experienced IPD meta-analysis authors and/or methodologists. ResultsPart 1: We identified 605 eligible IPD meta-analyses published between 1991 and 2024. The number of IPD meta-analyses published per year increased over time until 2019 but has since plateaued to about 60 per year. The most common clinical areas studied were cardiovascular disease (n=113, 19%) and cancer (n=110, 18%). The proportion of IPD meta-analyses published with Cochrane decreased over time from 16% (n=31/196) before 2015 to 3% (n=5/196) between 2021-2024. Part 2: 100 recent pairwise IPD meta-analyses were included in the in-depth review. Most cited PRISMA-IPD (68, 68%) and conducted risk of bias assessments (n=82, 82%), with just under half carrying out subgroup analyses not at risk of aggregation bias (n=36/85, 41%). However, only 33% (n=33) and 29% (n=29) respectively provided a protocol or statistical analysis plan, and only 7% (n=6/82) reported using IPD to inform risk of bias assessments. Part 3: 24 experts participated in a consensus workshop. Key recommendations for improved IPD meta-analyses focused on transparency (prospective registration; published protocols and statistical analysis plans) and maximising value (searching trial registries; obtaining IPD for unpublished evidence; using IPD to address missing data and risk of bias). Methodologists and educators should strengthen dissemination of methods and support capacity building across clinical fields and geographical areas. ConclusionsThe application and methodological quality of IPD meta-analyses of randomised trials has increased in the last decade, but shortcomings remain. Implementing our consensus-based recommendations will ensure future IPD meta-analyses generate better evidence for clinical decision making. Study registrationOpen Science Framework (1) Summary boxesO_ST_ABSWhat is already known on this topicC_ST_ABSO_LIIPD meta-analyses of randomised trials are regularly used to inform clinical policy and practice. C_LIO_LIThey can provide better quality data and enable more thorough and robust analyses than standard aggregate data meta-analyses, but are resource-intensive and can be challenging to conduct, leading to variable methodological quality C_LIO_LIPrevious studies that evaluated the conduct of IPD meta-analyses pre-date several major developments, such as the introduction of the PRISMA-IPD reporting guideline. C_LI What this study addsO_LIThis is the most comprehensive assessment of IPD meta-analyses of randomised trials to date (605 studies), showing an increase in publications over time followed by a recent plateau. C_LIO_LIThe conduct of IPD meta-analysis has improved in recent years including increased use of prospective registration, assessment of risk of bias, appropriate analyses of patient subgroup effects and citing the PRISMA-IPD statement. C_LIO_LIMany shortcomings remain including (i) insufficient pre-specification of methods such as outcomes and analyses, (ii) sub-standard transparency (including publication of protocols, statistical analysis plans and reporting of analyses), and (iii) failure to gain maximum value of IPD (i.e. include unpublished trials, use the IPD to inform risk of bias and trustworthiness assessments, and address missing data appropriately); expert consensus recommendations are provided for how to address these gaps. C_LI

5

An Empirical Assessment of Inferential Reproducibility of Linear Regression in Health and Biomedical Research Papers

Jones, L.; Barnett, A.; Hartel, G.; Vagenas, D.

2026-04-07 health systems and quality improvement 10.64898/2026.04.07.26350296 medRxiv

Top 0.1%

33.4%

Show abstract

Background: In health research, variability in modelling decisions can lead to different conclusions even when the same data are analysed, a challenge known as inferential reproducibility. In linear regression analyses, incorrect handling of key assumptions, such as normality of the residuals and linearity, can undermine reproducibility. This study examines how violations of these assumptions influence inferential conclusions when the same data are reanalysed. Methods: We randomly sampled 95 health-related PLOS ONE papers from 2019 that reported linear regression in their methods. Data were available for 43 papers, and 20 were assessed for computational reproducibility, with three models per paper evaluated. The 14 papers that included a model at least partially computationally reproduced were then examined for inferential reproducibility. To assess the impact of assumption violations, differences in coefficients, 95% confidence intervals, and model fit were compared. Results: Of the fourteen papers assessed, only three were inferentially reproducible. The most frequently violated assumptions were normality and independence, each occurring in eight papers. Violations of independence were particularly consequential and were commonly associated with inferential failure. Although reproduced analyses often retained the same binary statistical significance classification as the original studies, confidence intervals were frequently wider, indicating greater uncertainty and reduced precision. Such uncertainty may affect the interpretation of results and, in turn, influence treatment decisions and clinical practice. Conclusion: Our findings demonstrate that substantial violations of key modelling assumptions often went undetected by authors and peer reviewers and, in many cases, were associated with inferential reproducibility failure. This highlights the need for stronger statistical education and greater transparency in modelling decisions. Rather than applying rigid or misinformed rules, such as incorrectly testing the normality of the outcome variable, researchers should adopt modelling frameworks guided by the research question and the study design. When assumptions are violated, appropriate alternatives, such as robust methods, bootstrapping, generalized linear models, or mixed-effects models, should be considered. Given that assumption violations were common even in relatively simple regression models, early and sustained collaboration with statisticians is critical for supporting robust, defensible, and clinically meaningful conclusions.

6

Interpretation of wide confidence intervals in meta-analytic estimates: Is the 'Absence of Evidence' 'Evidence of Absence'?

Miller, S. L.; Tuia, J.; Prasad, V.

2023-07-14 epidemiology 10.1101/2023.07.11.23292513 medRxiv

Top 0.1%

32.5%

Show abstract

IntroductionRecently, a Cochrane review by Jefferson et al. on physical interventions to slow the spread of respiratory viruses concluded that, "Wearing masks in the community probably makes little or no difference to the outcome of laboratory-confirmed influenza/SARS-CoV-2 compared to not wearing masks", though this finding had a wide confidence interval. Cochrane issued a rare clarifying statement, fueling controversy. We sought to contextualize the findings of the review by Jefferson et al. MethodsWe searched for consecutive reviews by Cochrane published on or before March 9th, 2023. We included studies where a central finding showed an intervention offered no statistically significant benefit, and ascertained the language used by reviewers to describe that result. We compare this to the report by Jefferson et al., and deemed it consistent or inconsistent with the language of their report. ResultsWe found between November 21st, 2022, and March 9th, 2023, there were 20 Cochrane reviews that met the inclusion criteria. We found that 95% (n = 19) of the reviews used language that was consistent with Jeffersons findings, while 5% (n = 1) used language inconsistent with Jeffersons conclusion, describing the effect of the intervention on the outcome as "unclear". DiscussionMost reviews performed by Cochrane conclude that interventions which fail to show statistically significant benefits make "no difference" have "no effect" or do not "increase or decrease" the outcome, and this occurs despite wide confidence intervals. The conclusions by Jefferson et al. are consistent with Cochrane reporting guidelines and clarification from the organization was unjustified.

7

The epidemiology of systematic review updates: a longitudinal study of updating of Cochrane reviews, 2003 to 2018.

Bastian, H.; Doust, J.; Clarke, M.; Glasziou, P.

2019-12-11 epidemiology 10.1101/19014134 medRxiv

Top 0.1%

30.8%

Show abstract

BackgroundThe Cochrane Collaboration has been publishing systematic reviews in the Cochrane Database of Systematic Reviews (CDSR) since 1995, with the intention that these be updated periodically. ObjectivesTo chart the long-term updating history of a cohort of Cochrane reviews and the impact on the number of included studies. MethodsThe status of a cohort of Cochrane reviews updated in 2003 was assessed at three time points: 2003, 2011, and 2018. We assessed their subject scope, compiled their publication history using PubMed and CDSR, and compared them to all Cochrane reviews available in 2002 and 2017/18. ResultsOf the 1,532 Cochrane reviews available in 2002, 11.3% were updated in 2003, with 16.6% not updated between 2003 and 2011. The reviews updated in 2003 were not markedly different to other reviews available in 2002, but more were retracted or declared stable by 2011 (13.3% versus 6.3%). The 2003 update led to a major change of the conclusions of 2.8% of updated reviews (n = 177). The cohort had a median time since publication of the first full version of the review of 18 years and a median of three updates by 2018 (range 1-11). The median time to update was three years (range 0-14 years). By the end of 2018, the median time since the last update was seven years (range 0-15). The median number of included studies rose from eight in the version of the review before the 2003 update, to 10 in that update and 14 in 2018 (range 0-347). ConclusionsMost Cochrane reviews get updated, however they are becoming more out-of-date over time. Updates have resulted in an overall rise in the number of included studies, although they only rarely lead to major changes in conclusion.

8

Amount and certainty of evidence in Cochrane systematic reviews of interventions: a large-scale meta-research study

Starck, T.; Ravaud, P.; Boutron, I.

2025-12-21 public and global health 10.64898/2025.12.19.25342674 medRxiv

Top 0.1%

29.6%

Show abstract

ObjectivesTo quantify the amount and certainty of evidence in Cochrane systematic reviews of interventions, and to describe how this evidence has evolved over time. DesignLarge-scale meta-research study Data sourceCochrane Database of Systematic Reviews (search date April 8, 2025) Eligibility criteriaCochrane systematic reviews assessing interventions reporting "Summary of findings" tables. Data extractionData were automatically extracted using web scraping and a large language model, with quality control performed by humans on a random sample. AnalysisWe describe the certainty of evidence for each population-intervention-comparison-outcome (PICO) question reported in all Cochrane "Summary of findings" tables. When available, we compared the certainty of evidence between the initial version and the latest update. ResultsWe identified 5,116 reviews that reported a "Summary of findings" table, containing 64,849 PICO questions. Overall, 24% (n = 15,768) of PICOs had no study included, 31% (n = 20,390) included only 1 study, 14% (n = 8,796) 2 studies, and 31% (n = 19,895) more than 2 studies. Nearly all PICOs (97%) only included randomized trials. The median [Q1-Q3] number of included participants was 123 [0-557]. The certainty of evidence was rated as high for 4% (n = 2,852), moderate for 16% (n = 10,574), low for 27% (n = 17,409), very low for 26% (n = 17,012), and not assessed for 26% (n = 17,002). Of the 7,461 PICO questions with an update (median time to update of 4.3 years [Q1-Q3: 2.6-6.4]), the number of included studies in the latest update remained the same for 63%; the certainty of evidence was unchanged for 71%; upgraded for 13% and downgraded for 15%. ConclusionThe amount and certainty of evidence is low and has not improved over time with review updates. These results question the efficiency of the research ecosystem. SummaryO_ST_ABSWhat is already known on this topicC_ST_ABSO_LIHigh quality up-to-date evidence synthesis is essential for decision-makers. C_LIO_LIConfidence in the evidence informing decision-making can be limited by the amount and quality of primary research on a specific research question C_LI What this study addsO_LIThis large-scale meta-research study analyzed all Cochrane "Summary of findings" tables (i.e., 64,849 population-intervention-comparison-outcome - PICO - questions), and found that about two thirds of the PICO questions were informed by two or fewer studies, with a median [Q1-Q3] of 123 [0-557] participants per PICO; the associated certainty of evidence was rated as high in only 4% of the cases. C_LIO_LIAfter an update of the review (i.e., 7,461 PICOs), 63% PICOs did not include additional studies, and 71% showed no change in certainty of evidence; upgrades and downgrades of certainty occurred at similar frequencies. C_LIO_LIThese results question the efficiency of the research ecosystem. C_LI

9

Systematic review search strategies are poorly described and not reproducible: a cross-sectional meta-research study

Rethlefsen, M. L.; Brigham, T. J.; Price, C.; Moher, D.; Bouter, L. M.; Kirkham, J. J.; Schroter, S.; Zeegers, M. P.

2023-05-16 epidemiology 10.1101/2023.05.11.23289873 medRxiv

Top 0.1%

29.4%

Show abstract

ObjectiveTo determine the reproducibility of biomedical systematic review search strategies. DesignCross-sectional meta-research study. PopulationRandom sample of 100 systematic reviews indexed in MEDLINE in November 2021. Main Outcome MeasuresThe primary outcome measure is the percentage of systematic reviews for which all database searches can be reproduced. This was operationalized as fulfilling six key PRISMA-S reporting guideline items (database name, multi-database searching, full search strategies, limits and restrictions, date(s) of searches, and total records) and having all database searches reproduced within 10% of the number of original results. ResultsThe 100 systematic review articles contained 453 database searches. Of those, 214 (47.2%) provided complete database information (named the database and platform; PRISMA-S item 1). Only 22 (4.9%) database searches reported all six PRISMA-S items. Forty-seven (10.4%) database searches could be reproduced within 10% of the number of results from the original search; 6 searches differed by more than 1000% between the originally reported number of results and the reproduction. Only one systematic review article provided the necessary details for all database searches to be fully reproducible. ConclusionSystematic review search reporting is poor. As systematic reviews and clinical practice guidelines based upon them continue to proliferate, so does research waste. To correct this will require a multi-faceted response from systematic review authors, peer reviewers, journal editors, and database providers.

10

Estimating the replicability of highly cited clinical research (2004-2018)

da Costa, G. G.; Neves, K.; Amaral, O. B.

2022-05-31 epidemiology 10.1101/2022.05.31.22275810 medRxiv

Top 0.1%

28.5%

Show abstract

IntroductionPrevious studies about the replicability of clinical research based on the published literature have suggested that highly cited articles are often contradicted or found to have inflated effects. Nevertheless, there are no recent updates of such efforts, and this situation may have changed over time. MethodsWe searched the Web of Science database for articles studying medical interventions with more than 2000 citations, published between 2004 and 2018 in high-impact medical journals. We then searched for replications of these studies in PubMed using the PICO (Population, Intervention, Comparator and Outcome) framework. Replication success was evaluated by the presence of a statistically significant effect in the same direction and by overlap of the replications effect size confidence interval (CIs) with that of the original study. Evidence of effect size inflation and potential predictors of replicability were also analyzed. ResultsA total of 89 eligible studies, of which 24 had valid replications (17 meta-analyses and 7 primary studies) were found. Of these, 21 (88%) had effect sizes with overlapping CIs. Of 15 highly cited studies with a statistically significant difference in the primary outcome, 13 (87%) had a significant effect in the replication as well. When both criteria were considered together, the replicability rate in our sample was of 20 out of 24 (83%). There was no evidence of systematic inflation in these highly cited studies, with a mean effect size ratio of 1.03 (95% CI [0.88, 1.21]) between initial and subsequent effects. Due to the small number of contradicted results, our analysis had low statistical power to detect predictors of replicability. ConclusionAlthough most studies did not have eligible replications, the replicability rate of highly cited clinical studies in our sample was higher than in previous estimates, with little evidence of systematic effect size inflation.

11

Challenges in the Computational Reproducibility of Linear Regression Analyses: An Empirical Study

Jones, L. V.; Barnett, A.; Hartel, G.; Vagenas, D.

2026-04-07 health systems and quality improvement 10.64898/2026.04.07.26350286 medRxiv

Top 0.1%

28.4%

Show abstract

Background: Reproducibility concerns in health research have grown, as many published results fail to be independently reproduced. Achieving computational reproducibility, where others can replicate the same results using the same methods, requires transparent reporting of statistical tests, models, and software use. While data-sharing initiatives have improved accessibility, the actual usability of shared data for reproducing research findings remains underexplored. Addressing this gap is crucial for advancing open science and ensuring that shared data meaningfully support reproducibility and enable collaboration, thereby strengthening evidence-based policy and practice. Methods: A random sample of 95 PLOS ONE health research papers from 2019 reporting linear regression was assessed for data-sharing practices and computational reproducibility. Data were accessible for 43 papers. From the randomly selected sample, the first 20 papers with available data were assessed for computational reproducibility. Three regression models per paper were reanalysed. Results: Of the 95 papers, 68 reported having data available, but 25 of these lacked the data required to reproduce the linear regression models. Only eight of 20 papers we analysed were computationally reproducible. A major barrier to reproducing the analyses was the great difficulty in matching the variables described in the paper to those in the data. Papers sometimes failed to be reproduced because the methods were not adequately described, including variable adjustments and data exclusions. Conclusion: More than half (60%) of analysed studies were not computationally reproducible, raising concerns about the credibility of the reported results and highlighting the need for greater transparency and rigour in research reporting. When data are made available, authors should provide a corresponding data dictionary with variable labels that match those used in the paper. Analysis code, model specifications, and any supporting materials detailing the steps required to reproduce the results should be deposited in a publicly accessible repository or included as supplementary files. To increase the reproducibility of statistical results, we propose a Model Location and Specification Table (MLast), which tracks where and what analyses were performed. In conjunction with a data dictionary, MLast enables the mapping of analyses, greatly aiding computational reproducibility.

12

Research waste from poor reporting of core methods and results and redundancy in studies of reporting guideline adherence: a meta-research review

Dal Santo, T.; Rice, D. B.; Amiri, L. S.; Tasleem, A.; Li, K.; Boruff, J. T.; Geoffroy, M.-C. B.; Benedetti, A.; Thombs, B.

2022-12-20 epidemiology 10.1101/2022.12.19.22283669 medRxiv

Top 0.1%

27.7%

Show abstract

ObjectivesWe investigated meta-research studies that evaluated adherence to prominent reporting guidelines (CONSORT, PRISMA, STARD, STROBE) in health research studies to determine the proportion that (1) provided an explanation for how complex guideline items were rated for adherence and (2) provided results from individual studies reviewed in addition to aggregate results. We also examined the conclusions of each meta-research study to assess redundancy of findings across studies. DesignCross-sectional meta-research review. Data sourcesMEDLINE (Ovid) searched on July 5, 2022. Eligibility criteria for selecting studiesStudies in any language were eligible if they used any version of the CONSORT, PRISMA, STARD, or STROBE reporting guidelines or their extensions to evaluate reporting in at least 10 human health research studies. We excluded studies that modified a reporting guideline or its items or evaluated fewer than half of reporting guideline items. Main outcomes were (1) the proportion of meta-research studies that provided a coding explanation that could be used to replicate the study or verify its results and (2) the proportion that provided individual-level study results in the main text, supplemental materials, or via an internet link. ResultsOf 148 included meta-research studies, 14 (10%, 95% confidence interval [CI] 6% to 15%) provided a fully replicable coding explanation, and 49 (33%, 95% CI 26% to 41%) completely reported individual study results. Of 90 studies that classified reporting as adequate or inadequate in the study abstract, 6 (7%, 95% CI 3% to 14%) concluded that reporting was adequate but none of those 6 studies provided information on how items were coded or provided item-level results for included studies. ConclusionsMuch of published meta-research on reporting in health research is likely wasteful. Few studies report enough information for verification or replication, and almost all find that reporting in health research studies is suboptimal. These findings highlight the importance of shifting the focus from assessing reporting adequacy to developing, testing, and implementing strategies to improve reporting. FundingThere was no specific funding for this study. ProtocolPosted on the Open Science Framework June 29, 2022 (https://osf.io/gtm4z/).

13

Automation of Systematic Reviews with Large Language Models

Cao, C.; Arora, R.; Cento, P.; Manta, K.; Farahani, E.; Cecere, M.; Selemon, A.; Sang, J.; Gong, L. X.; Kloosterman, R.; Jiang, S.; Saleh, R.; Margalik, D.; Lin, J.; Jomy, J.; Xie, J.; Chen, D.; Gorla, J.; Lee, S.; Zhang, K.; Ware, H.; Whelan, M. G.; Teja, B.; Leung, A. A.; Ghosn, L.; Arora, R. K.; Noetel, M.; Emerson, D. B.; Boutron, I.; Moher, D.; Church, G. M.; Bobrovitz, N.

2025-06-13 health informatics 10.1101/2025.06.13.25329541 medRxiv

Top 0.1%

26.8%

Show abstract

Structured AbstractO_ST_ABSImportanceC_ST_ABSSystematic reviews (SRs) inform evidence-based decision making. Yet, many take over a year to complete, are labor intensive, prone to human error, and face reproducibility challenges; thus limiting access to timely and reliable information. ObjectiveTo validate a large language model (LLM)-based workflow (otto-SR) to automate three of the most labour intensive tasks in performing SRs: article screening, data extraction, and risk of bias assessment; and to assess its feasibility in rapidly updating existing reviews. Design, setting, and participantsWe conducted a validation study in four phases, with direct benchmarking against graduate-level human researchers in phases 1 and 2. Phase 1: article screening performance was measured across 32,357 citations from 5 systematic reviews. The reference standard consisted of the original reviews screening decisions after full-text screening. Phase 2: data extraction performance was measured across 4,495 data points from 495 studies in 7 reviews. Phase 3: risk of bias assessment (ROB2, Newcastle-Ottawa, QUADAS2) performance was measured across 345 studies from 12 reviews. Reference standards for Phase 2 and Phase 3 were created after blinded adjudication of the original review extraction and RoB assessments. Phase 4: otto-SR was used to reproduce and update the primary analysis from an issue of Cochrane reviews (n=12 reviews, 146,276 citations), with analytical comparisons to the original meta-analyzed findings. All discrepancies underwent dual human review. Resultsotto-SR showed high performance in phase 1 article screening (otto-SR: 96.7% sensitivity, 97.9% specificity; human: 81.7% sensitivity, 98.1% specificity) and phase 2 data extraction (otto-SR: 93.1% accuracy; human: 79.7% accuracy). In phase 3, otto-SR demonstrated high interrater reliability for risk of bias judgements (ROB2 0.98, Newcastle-Ottawa 0.95, QUADAS2 0.74; Gwet AC2). In phase 4, otto-SR, reproduced and updated the primary analysis from an issue of Cochrane reviews. Across Cochrane reviews, otto-SR incorrectly excluded a median of 0 studies (IQR 0 to 0.25), and found nearly twice as many eligible studies compared to the original authors (n= 114 vs. 64). Meta-analyses based on otto-SR generated screening and extraction outputs, subsequently verified through dual human review, yielded newly statistically significant effect estimates in 2 reviews and negated significance in 1 review. Conclusions and relevanceLLMs have high performance in article screening, data extraction, and risk of bias assessments. They can rapidly reproduce and update existing systematic reviews, laying the foundation for automated, scalable, and reliable evidence synthesis.

14

An evaluation of reproducibility and errors in published sample size calculations performed using G*Power

Thibault, R. T.; Zavalis, E. A.; Malicki, M.; Pedder, H.

2024-08-05 health systems and quality improvement 10.1101/2024.07.15.24310458 medRxiv

Top 0.1%

26.5%

Show abstract

BackgroundPublished studies in the life and health sciences often employ sample sizes that are too small to detect realistic effect sizes. This shortcoming increases the rate of false positives and false negatives, giving rise to a potentially misleading scientific record. To address this shortcoming, many researchers now use point-and-click software to run sample size calculations. ObjectiveWe aimed to (1) estimate how many published articles report using the G*Power sample size calculation software; (2) assess whether these calculations are reproducible and (3) error-free; and (4) assess how often these calculations use G*Powers default option for mixed-design ANOVAs--which can be misleading and output sample sizes that are too small for a researchers intended purpose. MethodWe randomly sampled open access articles from PubMed Central published between 2017 and 2022 and used a coding form to manually assess 95 sample size calculations for reproducibility and errors. ResultsWe estimate that more than 48,000 articles published between 2017 and 2022 and indexed in PubMed Central or PubMed report using G*Power (i.e., 0.65% [95% CI: 0.62% - 0.67%] of articles). We could reproduce 2% (2/95) of the sample size calculations without making any assumptions, and likely reproduce another 28% (27/95) after making assumptions. Many calculations were not reported transparently enough to assess whether an error was present (75%; 71/95) or whether the sample size calculation was for a statistical test that appeared in the results section of the publication (48%; 46/95). Few articles that performed a calculation for a mixed-design ANOVA unambiguously selected the non-default option (8%; 3/36). ConclusionPublished sample size calculations that use G*Power are not transparently reported and may not be well-informed. Given the popularity of software packages like G*Power, they present an intervention point to increase the prevalence of informative sample size calculations.

15

Enough evidence and other endings: a descriptive study of stable Cochrane systematic reviews in 2019.

Bastian, H.; Hemkens, L. G.

2019-12-09 epidemiology 10.1101/19013912 medRxiv

Top 0.1%

24.8%

Show abstract

BackgroundFrom 2006 to 2019, Cochrane reviews could be designated "stable" if they were not being updated but highly likely to be current. This provides an opportunity to observe practice in ending systematic reviewing and what is regarded as enough evidence. MethodsWe identified Cochrane reviews designated stable in 2013 and 2019 and reasons for this designation. For those with conclusions stated to be so firm that new evidence is unlikely to change them, we assessed conclusions, strength of evidence ratings, and recommendations for further research. We assessed the fate of the 2013 stable reviews. We also estimated usage of formal analytic methods to determine when there is enough evidence in protocols for Cochrane reviews. ResultsCochrane reviews were rarely designated stable. In 2019, there were 507 stable Cochrane reviews (6.6% of 7,645 non-withdrawn reviews). The most common reasons related to no, little, or infrequent research activity expected (331 of 505; 65.5%). Only 39 reviews were stable because of firm conclusions unlikely to be changed by new evidence (7.7%), but that declaration was mostly not supported by judgments made in the review about strength of evidence and implications for research. Among the 180 reviews stable in 2013, 16 reverted to normal status (8.9%), with 2 of those changing conclusions because of new studies. Few Cochrane protocols specified an analytic method for determining when there was enough evidence to stop updating the review (116 of 2,415; 4.8%). ConclusionCochrane reviews were more likely to end because important future primary research activity was believed to be unlikely, than because there was enough evidence. Judgments about the strength of evidence and need for research were often inconsistent with the declaration that conclusions were unlikely to change. The inconsistencies underscore the need for reliable analytic methods to support decision-making about the conclusiveness of evidence.

16

How often is the core outcome set for low back pain used in clinical trials? A protocol for a meta-epidemiological study

Innocenti, T.; Salvioli, S.; Logullo, P.; Giagio, S.; Ostelo, R.; Chiarotto, A.

2023-01-11 epidemiology 10.1101/2023.01.11.23284425 medRxiv

Top 0.1%

23.9%

Show abstract

BackgroundNon-specific Low back pain (NSLBP) is the worldwide leading cause of disability, accounting for large costs for healthcare systems and work productivity. Many treatment options are available for patients with NSLBP. Authors of systematic reviews on LBP report that outcomes are often measured and reported inconsistently. This inconsistency limits the comparison of findings among trials, and it can be due to selective outcome reporting bias (e.g. reporting only outcomes with positive results in a publication), which strongly affects the conclusions of systematic reviews. Recommendations for standardised reporting of outcome measurement instruments in clinical studies were initially publicated in 1998 and updated through an international consensus Delphi study by Chiarotto and colleagues in 2015. This updated Core Outcome Set (COS) for NSLBP included the following core outcome domains: "physical functioning", "pain intensity", "health-related quality of life", and "number of deaths". With the exception of "number of deaths", the other three core domains were already included in the core set publicated in 1998 by Deyo et al. In 2018, another international consensus of Chiarotto et al. formulated recommendations on which core outcome measurement instruments (Core Outcome Measurement Set - COMS) should be used in NSLBP trials. A consensus was reached on Numeric Rating Scale (NRS) for "pain intensity", Oswestry Disability Index (ODI) or Roland-Morris Disability Questionnaire (RMDQ-24) for "physical functioning", Short Form Health Survey 12 (SF12) or 10-item PROMIS Global Health (PROMIS-GH-10) for "HRQOL". Therefore, the recommended COS has been in the public domain for more than 20 years. However, it is still unknown whether it has changed the selection of outcomes used in NSLBP trials during this period. Objectives(1)To assess the uptake of the COS for NSLBP in clinical trials; (2)To assess the uptake of the Core Outcome Measurement Set for NSLBP in clinical trials; (3)To analyse whether specific study characteristics (year of registration, sample size, country of origin, duration of follow-up, phase of the trial, intervention, and source of funding) are associated with the COS uptake MethodsWe will adopt Kirkham et al.s recommendations on the assessment of COS uptake. We will search the World Health Organization (WHO) International Clinical Trials Registry Platform (ICTRP) and Clinicaltrials.gov registry to identify potentially eligible trial protocols. Two reviewers (TI and SG) will select potentially eligible entries and evaluate whether they meet the eligibility criteria. A consensus meeting will be held to determine agreement on the selection; in case of disagreement, a third reviewer (SS) will decide on inclusion. We will calculate the percentage of clinical trials that planned to measure data on the NSLBP full COS. We will also calculate the proportion of trials that reported the percentage of trials measuring the full COS per year. We will calculate the percentage of the NSLBP core outcome measurement instruments used per each domain described in the COS. Lastly, we will perform a multivariable logistic regression analysis to assess the relationship between the full COS uptake (yes/no) as the dependent variable and the following independent variables: year of registration, sample size, country of origin, duration of follow-up interval, phase of the trial (III or IV), intervention (pharmacological trial vs non-pharmacological trial), and source of funding (commercial vs non-commercial vs no funding). Ethics and disseminationA manuscript will be prepared and submitted for publication in an appropriate peer-reviewed journal upon study completion. We believe that the results of this investigation will be relevant to researchers paying more attention to the synthesis of the evidence to translate clinical implications to key stakeholders (healthcare providers and patients).

17

Sensitivity, specificity and avoidable workload of using a large language models for title and abstract screening in systematic reviews and meta-analyses

Tran, V.-T.; Gartlehner, G.; Yaacoub, S.; Boutron, I.; Schwingshackl, L.; Stadelmaier, J.; Sommer, I.; Aboulayeh, F.; Afach, S.; Meerpohl, J.; Ravaud, P.

2023-12-17 epidemiology 10.1101/2023.12.15.23300018 medRxiv

Top 0.1%

23.9%

Show abstract

ImportanceSystematic reviews are time-consuming and are still performed predominately manually by researchers despite the exponential growth of scientific literature. ObjectiveTo investigate the sensitivity, specificity and estimate the avoidable workload when using an AI-based large language model (LLM) (Generative Pre-trained Transformer [GPT] version 3.5-Turbo from OpenAI) to perform title and abstract screening in systematic reviews. Data SourcesUnannotated bibliographic databases from five systematic reviews conducted by researchers from Cochrane Austria, Germany and France, all published after January 2022 and hence not in the training data set from GPT 3.5-Turbo. DesignWe developed a set of prompts for GPT models aimed at mimicking the process of title and abstract screening by human researchers. We compared recommendations from LLM to rule out citations based on title and abstract with decisions from authors, with a systematic reappraisal of all discrepancies between LLM and their original decisions. We used bivariate models for meta-analyses of diagnostic accuracy to estimate pooled estimates of sensitivity and specificity. We performed a simulation to assess the avoidable workload from limiting human screening on title and abstract to citations which were not "ruled out" by the LLM in a random sample of 100 systematic reviews published between 01/07/2022 and 31/12/2022. We extrapolated estimates of avoidable workload for health-related systematic reviews assessing therapeutic interventions in humans published per year. ResultsPerformance of GPT models was tested across 22,666 citations. Pooled estimates of sensitivity and specificity were 97.1% (95%CI 89.6% to 99.2%) and 37.7%, (95%CI 18.4% to 61.9%), respectively. In 2022, we estimated the workload of title and abstract screening for systematic reviews to range from 211,013 to 422,025 person-hours. Limiting human screening to citations which were not "ruled out" by GPT models could reduce workload by 65% and save up from 106,268 to 276,053-person work hours (i.e.,66 to 172-person years of work), every year. Conclusions and RelevanceAI systems based on large language models provide highly sensitive and moderately specific recommendations to rule out citations during title and abstract screening in systematic reviews. Their use to "triage" citations before human assessment could reduce the workload of evidence synthesis.

18

Time-to-retraction and likelihood of evidence contamination (VITALITY Extension I): a retrospective cohort analysis

Yuan, Y.; Peng, Z.; Doi, S. A. R.; Furuya-Kanamori, L.; Cao, H.; Lin, L.; Chu, H.; Loke, Y.; Mol, B. W.; Golder, S.; Vohra, S.; Xu, C.

2026-02-24 epidemiology 10.64898/2026.02.20.26346631 medRxiv

Top 0.1%

23.7%

Show abstract

BackgroundThe number of problematic randomized clinical trials (RCTs) has risen sharply in recent decades, posing serious challenges to the integrity of the healthcare evidence ecosystem. ObjectiveTo investigate whether retraction of problematic RCTs could reduce evidence contamination. DesignRetrospective cohort study SettingA secondary analysis of the VITALITY Study database. Participants1,330 retracted RCTs with 847 systematic reviews. MeasurementsThe difference in the median number (and its interquartile, IQR) of contamination before and after retraction. The association between time-to-retraction and likelihood of evidence contamination. ResultsAmong these retracted RCTs, 426 led to evidence contamination, resulting in 1,106 contamination events (251 after retraction vs. 855 before retraction). The time interval between RCT publication and first contamination ranged from 0.2 to 30.9 years, with a median of 3.3 years (95% CI: 3.0 to 3.9). The median number of contaminated systematic reviews was lower after retraction than before retraction (0, IQR: 0 to 1 vs. 1, IQR: 1 to 2, P < 0.01). Compared with trials retracted more than 7.5 years after publication, those retracted between 1.0 and 1.8 years (OR = 0.70, 95% CI: 0.60 to 0.80) and retracted within 1.0 year (OR = 0.69, 95% CI: 0.60 to 0.80) were associated with lower likelihood of evidence contamination. LimitationsOnly assessed contaminated systematic reviews with quantitative synthesis and limited to retracted RCTs. ConclusionsRetracting problematic RCTs can significantly reduce evidence contamination, and faster retraction was associated with less contamination. To safeguard the integrity of the evidence ecosystem, academic journals should act promptly in the retraction of problematic studies to minimize their downstream impact. Primary Funding SourcesThe National Natural Science Foundation of China (72204003, 72574229)

19

Retracted randomized trials attributed to super-retractors and top-cited scientists with multiple retractions: secondary analysis of the VITALITY retrospective cohort

Lyu, C.; Matbouriahi, M.; Naudet, F.; Ioannidis, J. P. A.; Cristea, I. A.

2025-11-25 epidemiology 10.1101/2025.11.23.25340834 medRxiv

Top 0.1%

23.4%

Show abstract

ImportanceMultiple retractions from the same author often uncover issues affecting their entire work, such as having systematically altered or fabricated data. ObjectivesEvaluate the contribution of authors with most retractions ("super-retractors") and top-cited scientists with multiple retractions to the retracted clinical trial literature. DesignRetrospective cohort study, linking an openly available cohort (VITALITY) of 1330 retracted randomized clinical trials (RCTs) to three lists of scientists: super-retractors, totaling most retractions in the Retraction Watch Leaderboard, and top-cited scientists, over the entire career or in the most recent single year, who accumulated 10 or more retractions not due to editor/publisher errors. The VITALITY cohort was updated up to November 2024. The three author lists were updated in August 2025. Participants30 super-retractors, 163 career-long and 174 single-year scientists totaling 10 or more retractions. Main outcomesAuthorship and characteristics of retracted RCTs (publication and retraction year, time between publication and retraction, number of citations). Results6/30 super-retractors, representing Anesthesiology and Endocrinology & Metabolism, co-authored 290/1330 retracted RCTs (22%). 18/163 career-long top-cited scientists with at least 10 retractions, representing 10 fields, co-authored 327/1330 trials (25%), 275 (84%) of which were also co-authored by a super-retractor. 7/174 single-year top-cited scientists with at least 10 retractions co-authored 50 retracted trials; all of them were also among the career-long top-cited scientists with at least 10 retractions. Articles with super-retractors authors vs not were published earlier (median (IQR)= 2000 (1997-2005) vs 2020 (2014-2022)); retracted earlier (median (IQR)= 2013 (2012-2019) vs 2023 (2018.5-2023)); had a longer lag between publication and retraction, (median (IQR)= 5111 (3560-6820) vs 482 (330-1119) days); and accrued more citations (median (IQR)= 21 (12-42) vs 5 (1-19)). In multivariable regression models, only time to retraction ({beta} = 0.02, P < 0.001) was significantly and positively associated with total citations. Results were similar when comparing retracted articles from top-cited scientists with at least 10 retractions versus other articles. Conclusions and relevanceIn this cohort study of 1330 retracted RCTs, a small number of influential authors, often co-authors and concentrated across few fields of medicine and countries, account for a significant proportion of retracted clinical trials. Key pointsO_ST_ABSQuestionC_ST_ABSWhat is the contribution of the authors with most retractions ("super-retractors") and of those top-cited with multiple retractions to the retracted randomized clinical trials literature? FindingsIn this cohort study, six super-retractors, from Anesthesiology and Endocrinology & Metabolism, co-authored one fifth of all retracted trials, while 18 top-cited scientists with over 10 retractions co-authored a quarter of them. Articles co-authored by super-retractors or by top-cited scientists with multiple retractions were published and retracted earlier, took longer to retract and accumulated more citations. MeaningRetracted clinical trials are disproportionately associated with a small number of influential authors, often co-authors and concentrated across few subfields of medicine and countries.

20

Agreeability testing of AMSTAR-PF, a tool for quality appraisal of systematic reviews of prognostic factor studies

Henry, M.; O'Connell, N.; Riley, R.; Moons, K.; Shea, B.; Hooft, L.; Wallwork, S.; Damen, J.; Skoetz, N.; Appiah, R.; Berryman, C.; Crouch, S.; Ferencz, G.; Grant, A.; Henry, K.; Herman, A.; Karran, E.; Koralegedera, I.; Leake, H.; MacIntyre, E.; Mouatt, B.; Phuentsho, K.; Van Der Laan, D.; Welsby, E.; Wiles, L.; Wilkinson, E.; Wilson, M.; Wilson, M.; Moseley, L.

2025-04-14 epidemiology 10.1101/2025.04.10.25325555 medRxiv

Top 0.1%

23.2%

Show abstract

BackgroundThis paper details initial testing of the agreeability and usability of a novel quality appraisal tool for systematic reviews of prognostic factor studies: AMSTAR-PF. MethodsFourteen appraisers each assessed eight systematic reviews using AMSTAR-PF. Their ratings for each question and each article were compared, with interrater, inter-pair and intrapair agreeability calculated using Gwets agreement coefficient. Time of use and time to reach consensus were also recorded. ResultsInterrater agreement averaged 0.59 (range, 0.21-0.90), inter-pair 0.61 (range 0.24-0.91) and intrapair 0.75 (range 0.45-0.95) across the domains, with agreement for the overall rating 0.46 (95%CI 0.30-0.62) for interrater, 0.46 (95%CI 0.17-0.74) for inter-pair, and 0.68 (range of averages 0.22-1.00) for intrapair agreement. The majority (60.7%) of intrapair ratings were identical, with 94.6% of final ratings either identical or only one category different for the overall appraisal. The time taken to appraise a study with AMSTAR-PF improved with use and averaged around 34 minutes after the first two appraisals. ConclusionsDespite some variance in agreeability for different domains and between different appraisers, the testing results suggest that AMSTAR-PF has clear utility for appraising the quality of systematic reviews of prognostic factor studies.